11 research outputs found

    Spontananfragen auf Datenströmen

    Get PDF
    Many modern applications require processing large amounts of data in a real-time fashion. As a result, distributed stream processing engines (SPEs) have gained significant attention as an important new class of big data processing systems. The central design principle of these SPEs is to handle queries that potentially run forever on data streams with a query-at-a-time model, i.e., each query is optimized and executed separately. However, in many real applications, not only long-running queries but also many short-running queries are processed on data streams. In these applications, multiple stream queries are created and deleted concurrently, in an ad-hoc manner. The best practice to handle ad-hoc stream queries is to fork input stream and add additional resources for each query. However, this approach leads to redundant computation and data copy. This thesis lays the foundation for efficient ad-hoc stream query processing. To bridge the gap between stream data processing and ad-hoc query processing, we follow a top-down approach. First, we propose a benchmarking framework to analyze state-of-the-art SPEs. We provide a definition of latency and throughput for stateful operators. Moreover, we carefully separate the system under test and the driver, to correctly represent the open-world model of typical stream processing deployments. This separation enables us to measure the system performance under realistic conditions. Our solution is the first benchmarking framework to define and test the sustainable performance of SPEs. Throughout our analysis, we realize that the state-of-the-art SPEs are unable to execute stream queries in an ad-hoc manner. Second, we propose the first ad-hoc stream query processing engine for distributed data processing environments. We develop our solution based on three main requirements: (1) Integration: Ad-hoc query processing should be a composable layer that can extend stream operators, such as join, aggregation, and window operators; (2) Consistency: Ad-hoc query creation and deletion must be performed consistently and ensure exactly-once semantics and correctness; (3) Performance: In contrast to modern SPEs, ad-hoc SPEs should not only maximize data throughput but also query throughout via incremental computation and resource sharing. Third, we propose an ad-hoc stream join processing framework that integrates dynamic query processing and query re-optimization techniques with ad-hoc stream query processing. Our solution comprises an optimization layer and a stream data processing layer. The optimization layer periodically re-optimizes the query execution plan, performing join reordering and vertical and horizontal scaling at runtime without stopping the execution. The data processing layer enables incremental and consistent query processing, supporting all the actions triggered by the optimizer. The result of the second and the third contributions forms a complete ad-hoc SPE. We utilize the first contribution not only for benchmarking modern SPEs but also for evaluating the ad-hoc SPE.Eine Vielzahl moderner Anwendungen setzten die Echtzeitverarbeitung großer Datenmengen voraus. Aus diesem Grund haben neuerdings verteilte Systeme zur Verarbeitung von Datenströmen (sog. Datenstrom-Verarbeitungssysteme, abgek. "DSV") eine wichtige Bedeutung als neue Kategorie von Massendaten-Verarbeitungssystemen erlangt. Das zentrale Entwurfsprinzip dieser DSVs ist es, Anfragen, die potenziell unendlich lange auf einem Datenstrom laufen, jeweils Eine nach der Anderen zu verarbeiten (Englisch: "query-at-a-time model"). Das bedeutet, dass jede Anfrage eigenständig vom System optimiert und ausgeführt wird. Allerdings stellen vielen reale Anwendungen nicht nur lang laufende Anfragen auf Datenströmen, sondern auch kurz laufende Spontananfragen. Solche Anwendungen können mehrere Anfragen spontan und zeitgleich erstellen und entfernen. Das bewährte Verfahren, um Spontananfragen zu bearbeiten, zweigt den eingehenden Datenstrom ab und belegt zusätzliche Ressourcen für jede neue Anfrage. Allerdings ist dieses Verfahren ineffizient, weil Spontananfragen damit redundante Berechnungen und Daten-Kopieroperationen verursachen. In dieser Arbeit legen wir das Fundament für die effiziente Verarbeitung von Spontananfragen auf Datenströmen. Wir schließen in den folgenden drei Schritten die Lücke zwischen verteilter Datenstromanfrage-Verarbeitung und Spontananfrage-Verarbeitung. Erstens stellen wir ein Benchmark-Framework zur Analyse von modernen DSVs vor. In diesem Framework stellen wir eine neue Definition für die Latenz und den Durchsatz von zustandsbehafteten Operatoren vor. Zudem unterscheiden wir genau zwischen dem zu testenden System und dem Treibersystem, um das offene-Welt Modell, welches den typischen Anwendungsszenarien in der Datenstromverabeitung entspricht, korrekt zu repräsentieren. Diese strikte Unterscheidung ermöglicht es, die Systemleistung unter realen Bedingungen zu messen. Unsere Lösung ist damit das erste Benchmark-Framework, welches die dauerhaft durchhaltbare Systemleistung von DSVs definiert und testet. Durch eine systematische Analyse aktueller DSVs stellen wir fest, dass aktuelle DSVs außerstande sind, Spontananfragen effizient zu verarbeiten. Zweitens stellen wir das erste verteilte DSV zur Spontananfrageverarbeitung vor. Wir entwickeln unser Lösungskonzept basierend auf drei Hauptanforderungen: (1) Integration: Spontananfrageverarbeitung soll ein modularer Baustein sein, mit dem Datenstrom-Operatoren wie z.B. Join, Aggregation, und Zeitfenster-Operatoren erweitert werden können; (2) Konsistenz: die Erstellung und Entfernung von Spontananfragen müssen konsistent ausgeführt werden, die Semantik für einmalige Nachrichtenzustellung erhalten, sowie die Korrektheit des Anfrage-Ergebnisses sicherstellen; (3) Leistung: Im Gegensatz zu modernen DSVs sollen DSVs zur Spontananfrageverarbeitung nicht nur den Datendurchsatz, sondern auch den Anfragedurchsatz maximieren. Dies ermöglichen wir durch inkrementelle Kompilation und der Ressourcenteilung zwischen Anfragen. Drittens stellen wir ein Programmiergerüst zur Verbeitung von Spontananfragen auf Datenströmen vor. Dieses integriert die dynamische Anfrageverarbeitung und die Nachoptimierung von Anfragen mit der Spontananfrageverarbeitung auf Datenströmen. Unser Lösungsansatz besteht aus einer Schicht zur Anfrageoptimierung und einer Schicht zur Anfrageverarbeitung. Die Optimierungsschicht optimiert periodisch den Anfrageverarbeitungsplan nach, wobei sie zur Laufzeit Joins neu anordnet und vertikal sowie horizontal skaliert, ohne die Verarbeitung anzuhalten. Die Verarbeitungsschicht ermöglicht eine inkrementelle und konsistente Anfrageverarbeitung und unterstützt alle zuvor beschriebenen Eingriffe der Optimierungsschicht in die Anfrageverarbeitung. Zusammengefasst ergeben unsere zweiten und dritten Lösungskonzepte eine vollständige DSV zur Spontananfrageverarbeitung. Wir verwenden hierzu unseren ersten Beitrag nicht nur zur Bewertung moderner DSVs, sondern auch zur Evaluation unseres DSVs zur Spontananfrageverarbeitung

    Benchmarking Distributed Stream Data Processing Systems

    Full text link
    The need for scalable and efficient stream analysis has led to the development of many open-source streaming data processing systems (SDPSs) with highly diverging capabilities and performance characteristics. While first initiatives try to compare the systems for simple workloads, there is a clear gap of detailed analyses of the systems' performance characteristics. In this paper, we propose a framework for benchmarking distributed stream processing engines. We use our suite to evaluate the performance of three widely used SDPSs in detail, namely Apache Storm, Apache Spark, and Apache Flink. Our evaluation focuses in particular on measuring the throughput and latency of windowed operations, which are the basic type of operations in stream analytics. For this benchmark, we design workloads based on real-life, industrial use-cases inspired by the online gaming industry. The contribution of our work is threefold. First, we give a definition of latency and throughput for stateful operators. Second, we carefully separate the system under test and driver, in order to correctly represent the open world model of typical stream processing deployments and can, therefore, measure system performance under realistic conditions. Third, we build the first benchmarking framework to define and test the sustainable performance of streaming systems. Our detailed evaluation highlights the individual characteristics and use-cases of each system.Comment: Published at ICDE 201

    PROTEUS: Scalable Online Machine Learning for Predictive Analytics and Real-Time Interactive Visualization

    Get PDF
    ABSTRACT Big data analytics is a critical and unavoidable process in any business and industrial environment. Nowadays, companies that do exploit big data's inner value get more economic revenue than the ones which do not. Once companies have determined their big data strategy, they face another serious problem: in-house designing and building of a scalable system that runs their business intelligence is difficult. The PROTEUS project aims to design, develop, and provide an open ready-to-use big data software architecture which is able to handle extremely large historical data and data streams and supports online machine learning predictive analytics and real-time interactive visualization. The overall evaluation of PROTEUS is carried out using a real industrial scenario. PROJECT DESCRIPTION PROTEUS 1 is an EU Horizon2020 2 funded research project, which has the goal to investigate and develop ready-to-use, scalable online machine learning algorithms and real-time interactive visual analytics, taking care of scalability, usability, and effectiveness. In particular, PROTEUS aims to solve the following big data challenges by surpassing the current state-of-art technologies with original contributions: 1. Handling extremely large historical data and data streams 2. Analytics on massive, high-rate, and complex data streams 3. Real-time interactive visual analytics of massive datasets, continuous unbounded streams, and learned models PROTEUS's solutions for the challenges above are: 1) a real-time hybrid processing system built on top of Apache Flink 3 (formerly Stratosphere 4 [1]) with optimized relational algebra and linear algebra operations support through LARA declarative language PROTEUS faces an additional challenge which deals with cor

    Büyük veri ile menü eniyilemesi

    No full text
    Farklı müşteri profilleri için en uygun menü kullanımı kullanılabilirlik, verimlilik ve müşteri memnuniyeti açısından esastır. Özellikle bankacılık gibi rekabetçi sektörlerde, en iyi menü kullanıcı arayüzüne sahip olmak bir zorunluluktur. Optimal menü yapısının belirlenmesi genellikle menü elemanının manuel ayarlanması ile gerçekleştirilir. Ancak, bu metot özellikle kompleks menülerde işe yaramaz. Bu çalışmada iki aşamadan oluşan cözüm önerilmiştir: kullanıcıları gruplandırmak ve gruplar için en uygun menüler bulmak. İlk bölüm için H(EC)2S, yeni hibrid Evrimsel Kümeleme algoritmasını geliştirdik. Ikinci bölümde optimal menü hesaplamak için Karışık Tamsayılı Programlama kullandık. Sonuçları gerçek ATM logları üzerinde test ettik ve performans artımı olduğunu gözlemledik.The use of optimal menu structuring for different customer profiles is essential because of usability, efficiency, and customer satisfaction. Especially in competitive industries such as banking, having optimal graphical user interface (GUI) is a must. Determining the optimal menu structure is generally accomplished through manual adjustment of the menu elements. However, such an approach is inherently flawed due to the overwhelming size of the optimization variables' search space. We propose a solution consisting of two phases: grouping users and finding optimal menus for groups. In first part, we used H(EC)2 S , novel Hybrid Evolutionary Clustering with Empty Clustering Solution. For second part we used Mixed Integer Programming (MIP) framework to calculate optimal menu. We evaluated the performance gains on a dataset of actual ATM usage logs. The results show that the proposed optimization approach provides significant reduction in the average transaction completion time and the overall click count

    High quality clustering of big data and solving empty-clustering problem with an evolutionary hybrid algorithm

    No full text
    3rd IEEE International Conference on Big Data, IEEE Big Data (2015 : Santa Clara; United States)Achieving high quality clustering is one of the most well-known problems in data mining. k-means is by far the most commonly used clustering algorithm. It converges fairly quickly, but achieving a good solution is not guaranteed. The clustering quality is highly dependent on the selection of the initial centroid selections. Moreover, when the number of clusters increases, it starts to suffer from "empty clustering". The motivation in this study is two-fold. We not only aim at improving the k-means clustering quality, but at the same time not being effected by the empty cluster issue. For achieving this purpose, we developed a hybrid model, H(EC)S-2, Hybrid Evolutionary Clustering with Empty Clustering Solution. Firstly, it selects representative points to eliminate Empty Clustering problem. Then, the hybrid algorithm uses only these points during centroid selection. The proposed model combines Fireworks and Cuckoo-search based evolutionary algorithm with some centroid-calculation heuristics. The model is implemented using a Hadoop Mapreduce algorithm for achieving scalability when faced with a Big Data clustering problem. The advantages of the developed model is particularly attractive when the amount, dimensionality and number of cluster parameters tend to increase. The results indicate that considerable clustering quality performance improvement is achieved using the proposed model.CCF,et al.,Huawi,IEEE Computer Society,National Science Foundation (NSF),Springe

    k-means Performance Improvements with Centroid Calculation Heuristics both for Serial and Parallel environments

    No full text
    4th IEEE International Congress on Big Data, BigData Congress  ( 2015 : New York City; United States)k-means is the most widely used clustering algorithm due to its fairly straightforward implementations in various problems. Meanwhile, when the number of clusters increase, the number of iterations also tend to slightly increase. However there are still opportunities for improvement as some studies in the literature indicate. In this study, improved implementations of k-means algorithm with a centroid calculation heuristics which results in a performance improvement over traditional k-means are proposed. Two different versions of the algorithm for various data sizes are configured, one for small and the other one for big data implementations. Both the serial and MapReduce parallel implementations of the proposed algorithm are tested and analyzed using 2 different data sets with various number of clusters. The results show that big data implementation model outperforms the other compared methods after a certain threshold level and small data implementation performs better with increasing k value.IEEE Computer Society Technical Committee on Services Computing (TC-SVC),Services Society (SS

    About some measures to be taken to improve pre-service and in-service educations in Azerbaijan (Organizational approach)

    Get PDF
    In this article the authors share their opinions about some necessary measures to be taken to strengthen the in and pre education services in Azerbaijan

    Generic Menu Optimization for Multi-profile Customer Systems

    No full text
    1st IEEE International Symposium on Systems Engineering (2015 : Italy)The use of optimal ATM menu structuring for different customer profiles is essential because of usability, efficiency, and customer satisfaction. Especially in competitive industries such as banking, having optimal user interface (UI) is a must. Determining the optimal menu structure is generally accomplished through manual adjustment of the menu elements. However, such an approach is inherently flawed due to the overwhelming size of the optimization variables' search space. Previous studies on menu optimization either are based on customer questionnaires or made for only a specific menu type using heuristic approaches (i.e., not generic). In this paper, we propose an systematic optimization method for menu structuring problem through a novel Mixed Integer Programming (MIP) framework. Our optimization approach is not specific to a predetermined menu class, on the contrary, the MIP model is designed to be a generic optimization framework that can be applied to a wide range of menu optimization problems. We evaluated the performance gains on a dataset of actual ATM usage logs for a period of 18 months consisting of 40 million transactions. We validated our results with both simulation application and mining of existing data logs. The results show that the proposed optimization approach provides significant reduction in the average transaction completion time and the overall click count.IEEE Systems Counci

    Benchmarking Synchronous and Asynchronous Stream Processing Systems

    No full text
    Processing high-throughput data-streams has become a major challenge in areas such as real-time event monitoring, complex dataflow processing, and big data analytics. While there has been tremendous progress in distributed stream processing systems in the past few years, the high-throughput and low-latency (a.k.a. high sustainable-throughput) requirement of modern applications is pushing the limits of traditional data processing infrastructures. To understand the upper bound of the maximum sustainable throughput that is possible for a given node configuration, we have designed multiple hard-coded multi-threaded processes (called ad-hoc dataflows) in C++ using Message Passing Interface (MPI) and Pthread libraries. Our preliminary results show that our ad-hoc design on average is 5.2 times better than Flink and 9.3 times better than Spark
    corecore